Performance Analysis of Apriori Algorithm with Different Data Structures on Hadoop Cluster

نویسندگان

Sudhakar Singh

Rakhi Garg

P. K. Mishra

چکیده

Mining frequent itemsets from massive datasets is always being a most important problem of data mining. Apriori is the most popular and simplest algorithm for frequent itemset mining. To enhance the efficiency and scalability of Apriori, a number of algorithms have been proposed addressing the design of efficient data structures, minimizing database scan and parallel and distributed processing. MapReduce is the emerging parallel and distributed technology to process big datasets on Hadoop Cluster. To mine big datasets it is essential to re-design the data mining algorithm on this new paradigm. In this paper, we implement three variations of Apriori algorithm using data structures hash tree, trie and hash table trie i.e. trie with hash technique on MapReduce paradigm. We emphasize and investigate the significance of these three data structures for Apriori algorithm on Hadoop cluster, which has not been given attention yet. Experiments are carried out on both real life and synthetic datasets which shows that hash table trie data structures performs far better than trie and hash tree in terms of execution time. Moreover the performance in case of hash tree becomes worst.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Adaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments

Hadoop MapReduce framework is an important distributed processing model for large-scale data intensive applications. The current Hadoop and the existing Hadoop distributed file system’s rack-aware data placement strategy in MapReduce in the homogeneous Hadoop cluster assume that each node in a cluster has the same computing capacity and a same workload is assigned to each node. Default Hadoop d...

متن کامل

Performance Evaluation of Apriori Algorithm on a Hadoop Cluster

Frequent Itemset Mining is a well-known concept in data sciences. If we feed frequent itemset miner algorithms with large datasets they become resource hungry fast as their search space explodes. This problem is even more apparent when we try to use them on Big Data. Recent advances in parallel programming provides good solutions to deal with large datasets but they present their own problems w...

متن کامل

Performance optimization of MapRe duce-base d Apriori algorithm on Hadoop cluster

Many techniques have been proposed to implement the Apriori algorithm on MapReduce framework but only a few have focused on performance improvement. FPC (Fixed Passes Combined-counting) and DPC (Dynamic Passes Combined-counting) algorithms combine multiple passes of Apriori in a single MapReduce phase to reduce the execution time. In this paper, we propose improved MapReduce based Apriori algor...

متن کامل

Mining Frequent Item Sets Using Map Reduce Paradigm

In Text categorization techniques like Text classification or clustering, finding frequent item sets is an acquainted method in the current research trends. Even though finding frequent item sets using Apriori algorithm is a widespread method, later DHP, partitioning, sampling, DIC, Eclat, FP-growth, H-mine algorithms were shown better performance than Apriori in standalone systems. In real sce...

متن کامل

Map / Reduce Deisgn and Implementation of Apriori Alogirthm for handling voluminous data-sets

Apriori is one of the key algorithms to generate frequent itemsets. Analysing frequent itemset is a crucial step in analysing structured data and in finding association relationship between items. This stands as an elementary foundation to supervised learning, which encompasses classifier and feature extraction methods. Applying this algorithm is crucial to understand the behaviour of structure...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1511.07017 شماره

صفحات -

تاریخ انتشار 2015

Performance Analysis of Apriori Algorithm with Different Data Structures on Hadoop Cluster

نویسندگان

چکیده

منابع مشابه

Adaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments

Performance Evaluation of Apriori Algorithm on a Hadoop Cluster

Performance optimization of MapRe duce-base d Apriori algorithm on Hadoop cluster

Mining Frequent Item Sets Using Map Reduce Paradigm

Map / Reduce Deisgn and Implementation of Apriori Alogirthm for handling voluminous data-sets

عنوان ژورنال:

اشتراک گذاری